We study some tech stock price through data visualization and some financial technique, focusing on those which are intended to give a sort of reliable prevision to permit brokers have a basis on which they could decide when it is the best moment to sell or buy stocks. We first analyze a year of data about the biggest companies as Amazon, Google, Apple and Microsoft but right after that we focus on Google stocks. At the end we leave the financial tools for more advanced machine learning technicques such as linear regression. We applied it first on the last 6 years of Google Trends about the word 'google' specifically searched in the financial news domain versus the last 6 years Google stock prices. Then we do a similar prevision, but based on twitter sentiment analysis in a brief period of time. At last we compute prediction based on a multivariate input, i.e. we use other stock prices to compute first a multivariate linear regression and at last a SVM.
keywords : Finance, Stock Price Analysis, MACD, Machine Learning, Linear Regression, SVM, Data Visualization
import pandas as pd
from pandas import Series,DataFrame
import numpy as np
# For Visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline
# For reading stock data from yahoo or google
from pandas.io.data import DataReader
# For time stamps
from datetime import datetime
# suppressing warnings
import warnings
warnings.filterwarnings('ignore')
# interactive plots
import plotly.plotly as py
import cufflinks as cf
import plotly.tools as tls
tls.set_credentials_file(username='affinito', api_key='m9u9j6y55e')
#from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
init_notebook_mode()
# The tech stocks we'll use for this analysis
tech_list = ['AAPL','GOOG','MSFT','AMZN']
# Set up End and Start times for data grab
end = datetime.now()
start = datetime(end.year - 1, end.month, end.day)
#For loop for grabing yahoo finance data and setting as a dataframe
for stock in tech_list:
# Set DataFrame as the Stock Ticker
globals()[stock] = DataReader(stock,'google',start,end)
#AAPL.to_csv("files/apple.csv")
GOOG.ix[:10,:-8]
Gli ultimi valori di chiusura di Google sono assenti!
cf.set_config_file(offline=True, world_readable=True, theme='ggplot')
GOOG['Close'].iplot(kind='scatter', title="Google stocks closing prices", yTitle='Price',
theme='white', bestfit=True)
GOOG['Volume'].iplot(title='Google stocks volume', theme='pearl', yTitle='Volume', bestfit=False )
Let's see the moving average of the apple stock price
mov_avg = [10, 20, 30]
for m in mov_avg:
GOOG[ str(m)+" MA" ] = pd.rolling_mean( GOOG['Close'], m)
GOOG.tail()
Moving averages versus the real closing value zoomed on last 100 market days.
GOOG[['Close','10 MA','20 MA','30 MA']][-100:].iplot(theme='white')
While the exponential moving average gives more weight to latest data, i.e. react faster to recent price changes thanks to exponentially decreasing weights associated to oldest data.
So let's have a look at the differences in a 26 day (we'll see later why) window:
window = 26
GOOG[str(window)+' MA'] = pd.rolling_mean(GOOG['Close'], window)
ema_center = (window-1)/2
GOOG[str(window)+' EMA']= pd.ewma(GOOG['Close'], com=ema_center)
GOOG[['Close',str(window)+' MA', str(window)+' EMA']].iplot(theme='white')
Moving average convergence divergence (MACD) is a trend-following momentum indicator that shows the relationship between two moving averages of prices. The MACD is calculated by subtracting the 26-day exponential moving average (EMA) from the 12-day EMA. A nine-day EMA of the MACD, called the "signal line", is then plotted on top of the MACD, functioning as a trigger for buy and sell signals.
GOOG['12 EMA'] = pd.ewma(GOOG['Close'], com=(12-1)/2)
GOOG['MACD'] = GOOG['12 EMA'] - GOOG['26 EMA']
#GOOG[['Close','MACD','26 MA']].iplot(theme='white')
GOOG[['Close','26 MA','MACD']].iplot( subplots=True, shape=(3,1), shared_xaxes=True,
title='Comparison between prevision power of MA and MACD', size=10 )
Now it's time to study the risk of the stock, that similar to ROI, is used to evaluate the efficiency of an investment or to compare the efficiency of a number of different investments. To calculate ROI, the benefit of an investment is divided by the cost of the investment, and the result is expressed as a percentage.
GOOG['Daily return'] = GOOG['Close'].pct_change()
GOOG['Daily return'].iplot(theme='white', yTitle='daily benefit')
What's happened on the 17 of July ?
Google gains billions in value as YouTube drives ad growth : Google Inc's (GOOGL.O) shares closed up 16.3 percent at \$699.62 on Friday, adding about \$65 billion to its market value, as strong growth in YouTube viewership eased investor concerns about Facebook Inc's (FB.O) push into video.
It’s official: Google books biggest day in history, adding \$66.9B : Google Inc.’s stock closed at a record \$699.62 on Friday, delivering \$66.9 billion to investors in one day — a record for Wall Street. Shares of Google GOOGL, -3.60% GOOG, -3.45% skyrocketed 16.3% on Friday, the company’s biggest one-day percentage gain since April 2008. The increase raised Google’s market capitalization by \$66.9 billion to $478 billion, according to FactSet.
Great, now let's get an overall look at the average daily return using a histogram. We'll see both a histogram with the percentage of having a given daily benefit.
GOOG['Daily return'].iplot(theme='white', kind='hist', histfunc='count', histnorm='percent')
Now let's compare all the closing prices of the different stocks
closing_df = DataFrame(data=[GOOG['Close'], AAPL['Close'], MSFT['Close'], AMZN['Close']])
closing_df = closing_df.transpose()
closing_df.columns = ['Google','Apple','Microsoft','Amazon']
# making the avg daily return dataframe
daily_returns_df = closing_df.pct_change()
daily_returns_df.head()
Risk analysis refers to the uncertainty of forecasted future cash flows streams, variance of portfolio/stock returns, statistical analysis to determine the probability of a project's success or failure, and possible future economic states.
One of the ways to quantify risk, is using the information we've gathered on daily percentage returns is by comparing the expected return with the standard deviation of the daily returns.
risk_df = DataFrame( [daily_returns_df.dropna().mean(), daily_returns_df.dropna().std()] )
risk_df = risk_df.transpose()
risk_df.columns = ['mean','std']
#risk_df.index = ['mean','std']
risk_df
daily_returns_df.iplot(theme='white', kind='box', yTitle='Risk')
Now, refering to the Google daily return histogram above let's get the amount of risk for the stock:
daily_returns_df['Google'].quantile(0.05)
The 0.05 empirical quantile of daily returns is at -0.023. That means that with 95% confidence, our worst daily loss will not exceed 2.3%.
Now let's check if there's any linear correlation between these stocks. We can do this trhough Seaborn pairplot which lets us see it in terms of Pearson product-moment correlation coefficient.
Remind Expectation doesn't have multiplicativity property in general, unless the variables are uncorrelated, this discrepancy is measured by covariance, that's why the graph isn't simmetric.

We'll discover two interesting high correlations.
sns.pairplot(daily_returns_df.dropna(), size=2.6)
I haven't found any paper or piece of news about this. Just coincidence ? Or maybe Apple and Google could be seen as some of the mayor clustering nodes of the higly connected (scale-free) stock market network [Statistical analysis of financial networks] and so they're probably highly correlated with other stocks?
sns.jointplot('Google','Amazon', daily_returns_df, kind='scatter', color='seagreen')
sns.jointplot('Apple','Microsoft', daily_returns_df, kind='scatter', color='seagreen')
daily_returns_df = daily_returns_df.dropna()
# rowvar=0 -> observations are in the columns
daily_corr = np.corrcoef(daily_returns_df[['Apple','Microsoft']].values, rowvar=1)
corr_min = daily_corr.min()
corr_max = daily_corr.max()
Next in a heatmap trace object, we overwrite the default minimum and maximum color scale values (with 'zmin' and ' zmax' respectively) to round up the color bar's range.
from plotly.graph_objs import *
corr_trace = Heatmap(
z=daily_corr, # correlation as color contours
x=daily_returns_df.index, # sites on both
y=daily_returns_df.index, # axes
zauto=False, # (!) overwrite Plotly's default color levels
zmin=corr_min, # (!) set value of min color level
zmax=corr_max, # (!) set value of max color level
colorscale='YIOrRd', # light yellow-orange-red colormap
reversescale=True # inverse colormap order
)
heatmap_title = "Apple-Microsoft yearly correlation heatmap from "+GOOG.index[0].strftime("%d/%m/%y")
# Make layout object
layout = Layout(
title=heatmap_title, # set plot title
autosize=False, # turn off autosize
height=500, # plot's height in pixels
width=800, # plot's width in pixels
margin=Margin(l=130), # adjust left margin
)
corr_data = Data([corr_trace])
fig = Figure( data=corr_data, layout=layout)
py.iplot(fig, filename='apple-microsoft-correlation-heatmap')
# remember you can zoom on the psychedelic figure!
Stock price forecasting is a popular and important topic in financial and academic studies. Time series analysis is the most common and fundamental method used to perform this task. This paper aims to combine the conventional time
series nalysis technique with information from the Google trend website and the Yahoo finance website to predict weekly changes in stock price.
Important news/events related to a selected stock over a five year span are recorded and the weekly Google trend index values on this stock are used to provide a measure of the magnitude of these events. The result of this experiment shows significant correlation between the changes in weekly stock prices and the values of important news/events computed from the Google trend website. The algorithm proposed in this paper can potentially outperform the conventional time series analysis in stock price forecasting.
[Stock Price Forecasting Using Information from Yahoo Finance and Google Trend - Selene Yue Xu]
google_trends contains the number of times the word "google" was searched in Google News per week. Before plotting the linear regression some criterion should be respected, such as the linearity between the two dataset can't be null. So we want to plot the percentual change of google trends over the percentual change of the google stock prices. What we expect is a very little correlation, thus no studies found any correlation significant for these samples.
end = datetime.now()
start = datetime(end.year - 6, end.month, end.day)
GOOG = DataReader('GOOG','google', start, end)
GOOG['Daily return'] = GOOG['Close'].pct_change()
print ( "Number of values : "+str(GOOG.size))
GOOG.head()
google_trends = pd.read_csv("files/trendGoogle_financeNews_2010-2016.csv")
# 319 values, but we need to divide for 5 days weeks, so let's keep 315
google_trends = google_trends[5:-3] # to have symmetric starts
google_trends.columns= ['Date','Value']
google_trends.index = pd.Series( range(google_trends.index.size ) ) # google_trends.index.size = 310
google_trends[['Value']] = google_trends[['Value']].astype(int)
google_trends.head()
Dividing the values by 5-days weeks and averaging
num_of_groups = len( GOOG['Daily return'])/5
drw_mean = np.array([0.0]*int(num_of_groups))
i = 0
for v in np.split( GOOG['Daily return'].values, num_of_groups):
drw_mean[i] = v.mean()
i+=1
print(" Num. of weeks available: "+str(num_of_groups))
print( drw_mean[:10] )
Now we can create the dataframe containing the two percentage series to plot on a correlation figure:
# let's generate the new dataframe
#trend_price_df = DataFrame( [google_trends['Value'][:300].pct_change(), drw_mean] )
#trend_price_df = trend_price_df.transpose()
#trend_price_df.columns = ['trend pct','stock pct']
trend_price_df = pd.read_csv("files/trend_price.csv", index_col=0) # to avoid to get different data in future executions
#trend_price_df.to_csv("files/trend_price.csv")
trend_price_df = trend_price_df.dropna() # who needs null values ?
trend_price_df.head()
# ..
sns.jointplot('trend pct','stock pct', trend_price_df, kind='scatter', color='seagreen', size=7)
So as we expected, the PPMCC is low (0.13), but it could still be interesting, although not significant, to try a prevision through a linear regression.
To perform supervised learning we have to chose how to represent our function. In this case we could approximate y, which corresponds to the vector of output values, as a linear function of x (called input or feature) :
$$h_\theta (x) = \theta_0 + \theta_1 x_{1}$$ where θi are the parameters or weights. This is the case of a univariate linear regression, which is why we have just one xi.
Now for running the prediction we want to minimize the distance between the hypotheses h(x) and y. We can do this throgh a cost function named least-squares cost function J(θ) : $$ J(\theta) = \frac{1}{2} \sum_{i=1}^{m}( h(x_i) - y_i)^2 $$
To minimize this cost function an algorithm called gradient descent is used. Where iteratively the function is derived respect to a secifical parameter, with the goal of making a step in the deepest decrease of J.
Numpy has a built in Least Square Method in its linear algebra library. We'll use this first for our Univariate regression and then move on to scikit learn for out Multi variate regression.
So we'll look at our line as y = mx+ b but we need to transform it in matrix form in order to use numpy.linalg.lstsq: y= Ap, where A=[x 1] and p=[m b]
import sklearn
from sklearn.linear_model import LinearRegression
#trend_price_df = trend_price_df.dropna()
X = np.array([ [value, 1] for value in trend_price_df['trend pct']])
Y = trend_price_df['stock pct']
m, b = np.linalg.lstsq(X, Y)[0]
Now that we have found out our linear coefficients, we can plot the line:
x = trend_price_df['trend pct']
plt.plot(x, Y,'o')
plt.plot(x, m*x + b, 'r', label='Best Fit Line')
Now it's time to find the error of the least square methods, for each element, it checks the the difference between the line and the true value, squares it, and returns the sum of all these.
# Get the resulting array
result = np.linalg.lstsq(X,Y)
# Get the total error
error_total = result[1]
# Get the root mean square error
rmse = np.sqrt(error_total/len(X) )
# Print
print ("The root mean squared error was %.3f " %rmse)
Since the root mean square error (RMSE) corresponds approximately to the standard deviation we can now say that the price of a house won't vary more than 2 times the RMSE 95% of the time ( 68–95–99.7 rule ).
tech_list = ['AAPL','MSFT','AMZN','TWTR','GOOG']
# nasdaq ?
end = datetime.now()
start = datetime(end.year - 4, end.month, end.day)
stocks = []
for t in tech_list:
stocks.append( DataReader(t, 'google', start, end) )
# TWTR!
# start 2013-11-07
stocks[4].size
import sklearn
from sklearn.linear_model import LinearRegression
X_stocks = DataFrame( [stocks[0]['Close'].pct_change(), stocks[1]['Close'].pct_change(),
stocks[2]['Close'].pct_change(), stocks[3]['Close'].pct_change() ] )
X_stocks = X_stocks.transpose()
X_stocks.columns = tech_list[:-1]
Y_target = stocks[4]['Close'].ix['2013-11-07': ].pct_change()
X_stocks = X_stocks.dropna()[-500:]
Y_target = Y_target.dropna()[-500:]
lreg = LinearRegression()
lreg.fit(X_stocks, Y_target ) # fits a linear model
print (' The estimated intercept coefficient is %.5f ' %lreg.intercept_)
print (' The number of coefficients used was %d ' % len(lreg.coef_) )
coef_df = DataFrame( X_stocks.columns )
coef_df.columns = ['Feature']
coef_df["Coefficient Estimate"] = pd.Series( lreg.coef_ )
coef_df
Where we can see the highest correlation with Microsoft stocks.
X_train, X_test, Y_train, Y_test = sklearn.cross_validation.train_test_split(X_stocks, Y_target)
# Print shapes of the training and testing data sets
print (X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
lreg.fit(X_train,Y_train)
pred_train = lreg.predict(X_train)
pred_test = lreg.predict(X_test)
print ("Fit a model X_train, and calculate MSE with Y_train: %.5f" % np.mean((Y_train - pred_train) ** 2) )
print ("Fit a model X_train, and calculate MSE with X_test and Y_test: %.5f" %np.mean((Y_test - pred_test) ** 2) )
The difference between the observed value of the dependent variable (y) and the predicted value (h(x)) is called the residual (e). Each data point has one residual, so that residual = ObservedValue − PredictedValue.
You can think of these residuals in the same way as the θ value we discussed earlier, in this case however, there were multiple data points considered, such that we have the more general equation
$$ h_\theta (x) = \sum_{i=0}^{m}{\theta_i x_i}$$
A residual plot is a graph that shows the residuals on the vertical axis and the independent variable on the horizontal axis. If the points in a residual plot are randomly dispersed around the horizontal axis, a linear regression model is appropriate for the data; otherwise, a non-linear model is more appropriate.
# Scatter plot the training data
train = plt.scatter( pred_train, (pred_train-Y_train), c='b', alpha=0.5)
# Scatter plot the testing data
test = plt.scatter( pred_test, (pred_test-Y_test), c='r', alpha=0.5, )
# Plot a horizontal axis line at 0
plt.hlines( y=0, xmin=-0.015, xmax=0.015)
#Labels
plt.legend((train,test),('Training','Test'), loc='lower left')
plt.title('Residual Plots')
In machine learning, support vector machines (SVMs) are supervised learning models with associated learning algorithms that analyze data and recognize patterns, used for classification and regression analysis. Given a set of training examples, each marked for belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other, making it a non-probabilistic binary linear classifier. An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on.
The advantages of support vector machines are:
The disadvantages of support vector machines include:
This technique was developed in three major steps.
So, Given training vectors xi in ℝp, i=1,..., n, and a vector y in ℝn ε-SVR solves the following primal problem:

As the constraints in the primal form are not convenient to handle, people have conventionally resorted to the dual problem, which is again a quadratic program, but with much simpler constraints: box constraints plus a single linear equality constraint.
The training examples can be divided into three categories according to the value of α*i. If α*i= 0, it means the corresponding training example does not affect the decision boundary, and in fact it lies beyond the margin, that is, yi( < w, xi > +b ) > 1.
If α*i ∈(0, n−1), then the training example lies on the margin.
If α*i = n−1 it means the training example violates the margin.
In the latter two cases where α*i > 0, the i-th training example is called a support vector.

e is the vector of all ones, C > 0 is the upper bound, Q is an n by n positive semidefinite matrix, Qij=K(xi, xj) = φ(xi)T φ(xj) is the kernel.
from sklearn.svm import SVR
from sklearn.cross_validation import train_test_split
'''
C Penalty parameter of the error term.
epsilon Epsilon in the epsilon-SVR model. It specifies the epsilon-tube within which no penalty is associated in the
training loss function with points predicted within a distance epsilon from the actual value.
kernel Specifies the kernel type to be used in the algorithm. RBF = radial basis function.
gamma Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’. If gamma is ‘auto’ then 1/n_features will be used
'''
svr_model = SVR(kernel='rbf', gamma='auto', C=1.0, epsilon=0.1, cache_size=500 )
# Split the data into Trainging and Testing sets
X_train, X_test, Y_train, Y_test = train_test_split( X_stocks, Y_target, train_size=0.7, random_state=42 )
print ("Subset shapes: X: {}, {}, y:{}, {}".format(X_train.shape, X_test.shape,Y_train.shape,Y_test.shape ))
# Fit the SVM model according to the given training data.
svr_model.fit( X_train.as_matrix().astype(float), Y_train.as_matrix().astype(float) )
X_train.as_matrix().astype(float)
np.shape(Y_train.values)
Regression trees ?